Copy, please, these files and directories to your personal directory:

cp -r /data/shared/AGE2020/Exercises/E07-RNA_seq/03_exploratory_analysis ~/AGE2020/Exercises/E07-RNA_seq

And switch the R working directory to the current exercise: setwd("~/AGE2020/Exercises/E07-RNA_seq/03_exploratory_analysis")


Introduction

In the previous tutorial we learned how to go from raw RNA-seq reads to count matrix imported to R. We were using a subset of reads from the airway experiment. However, a full version of this experiment is available in the airway package (as the SummarizedExperiment object). We will further work with this object.

Data preparation

Some vector/matrix operations can be speed-up by parallelization. Package BiocParallel offers parallelized versions of functions from the apply family. For example, parallelized version of lapply() is bplapply(). Many functions in DESeq2 can be run parallelized.

We create a BPPARAM variable containing information about parallelization parameters (2 threads). We will reuse this object later. If we call register() with our BPPARAM, some packages (e.g. DESeq2), will then run parallelized automatically, using our BPPARAM.

library(BiocParallel)

BPPARAM <- MulticoreParam(workers = 2)
register(BPPARAM)

Now let’s load the SummarizedExperiment object of the airway data:

source("../../age_library.R")
library(magrittr)
library(airway)
data("airway")
se <- airway
se
## class: RangedSummarizedExperiment 
## dim: 64102 8 
## metadata(1): ''
## assays(1): counts
## rownames(64102): ENSG00000000003 ENSG00000000005 ... LRG_98 LRG_99
## rowData names(0):
## colnames(8): SRR1039508 SRR1039509 ... SRR1039520 SRR1039521
## colData names(9): SampleName cell ... Sample BioSample
colData(se)
## DataFrame with 8 rows and 9 columns
##            SampleName     cell      dex    albut        Run avgLength Experiment    Sample    BioSample
##              <factor> <factor> <factor> <factor>   <factor> <integer>   <factor>  <factor>     <factor>
## SRR1039508 GSM1275862   N61311    untrt    untrt SRR1039508       126  SRX384345 SRS508568 SAMN02422669
## SRR1039509 GSM1275863   N61311      trt    untrt SRR1039509       126  SRX384346 SRS508567 SAMN02422675
## SRR1039512 GSM1275866  N052611    untrt    untrt SRR1039512       126  SRX384349 SRS508571 SAMN02422678
## SRR1039513 GSM1275867  N052611      trt    untrt SRR1039513        87  SRX384350 SRS508572 SAMN02422670
## SRR1039516 GSM1275870  N080611    untrt    untrt SRR1039516       120  SRX384353 SRS508575 SAMN02422682
## SRR1039517 GSM1275871  N080611      trt    untrt SRR1039517       126  SRX384354 SRS508576 SAMN02422673
## SRR1039520 GSM1275874  N061011    untrt    untrt SRR1039520       101  SRX384357 SRS508579 SAMN02422683
## SRR1039521 GSM1275875  N061011      trt    untrt SRR1039521        98  SRX384358 SRS508580 SAMN02422677

So far we have been working with four samples simply divided to untrt and trt groups (trt was dexamethasone). Now there is an additional division of samples by cell line (cell column).

You can see that the se object is an instance of class RangedSummarizedExperiment. In this object there is an additional list (GRangesList) holding genomic coordinates (GRanges) of rows (features - genes) of the count matrix (assays). GRangesList can be accessed by rowRanges() function:

rowRanges(se)
## GRangesList object of length 64102:
## $ENSG00000000003
## GRanges object with 17 ranges and 2 metadata columns:
##        seqnames            ranges strand |   exon_id       exon_name
##           <Rle>         <IRanges>  <Rle> | <integer>     <character>
##    [1]        X 99883667-99884983      - |    667145 ENSE00001459322
##    [2]        X 99885756-99885863      - |    667146 ENSE00000868868
##    [3]        X 99887482-99887565      - |    667147 ENSE00000401072
##    [4]        X 99887538-99887565      - |    667148 ENSE00001849132
##    [5]        X 99888402-99888536      - |    667149 ENSE00003554016
##    ...      ...               ...    ... .       ...             ...
##   [13]        X 99890555-99890743      - |    667156 ENSE00003512331
##   [14]        X 99891188-99891686      - |    667158 ENSE00001886883
##   [15]        X 99891605-99891803      - |    667159 ENSE00001855382
##   [16]        X 99891790-99892101      - |    667160 ENSE00001863395
##   [17]        X 99894942-99894988      - |    667161 ENSE00001828996
##   -------
##   seqinfo: 722 sequences (1 circular) from an unspecified genome
## 
## ...
## <64101 more elements>

You can see that GRangesList has a length of 64102 (= number of rows in se) and its keys are ENSEMBL IDs.

We don’t go deeper now, for more information see ?RangedSummarizedExperiment. I would just note that you can use subset() function to filter features based on GRanges columns (e.g. start, end), and that could be handy.


Let’s go back to our experimental data. We want to be sure that untrt is the reference level for the dex variable:

se$dex <- relevel(se$dex, "untrt")
se$dex
## [1] untrt trt   untrt trt   untrt trt   untrt trt  
## Levels: untrt trt

From se object we can immediately create a DESeqDataSet object:

library(DESeq2)
dds <- DESeqDataSet(se, design = ~ cell + dex)

Let’s annotate genes by using the code from previous tutorial. Note that a custom function for this repeating code would be neat 🙂

library(org.Hs.eg.db)
ann <- AnnotationDbi::select(org.Hs.eg.db, keys = rownames(dds), keytype = "ENSEMBL", columns = c("SYMBOL", "GENENAME"))
ann <- ann[!duplicated(ann$ENSEMBL), ]
rownames(ann) <- ann$ENSEMBL
ann <- ann[rownames(dds), ]
stopifnot(all(rownames(ann) == rownames(dds)))
rowData(dds) <- cbind(rowData(dds), ann)

In RNA-seq, count matrices are usually quite sparse, i.e. some genes have zero counts in almost all samples. We can quickly remove these genes (rows) and consequently speed-up an analysis and reduce the size of the object. We remove all rows having zero or one count across the samples:

nrow(dds)
## [1] 64102
dds <- dds[rowSums(counts(dds)) > 1, ]
nrow(dds)
## [1] 29391

BAM! More than half of the rows disappeared.

We can now prepare the dds object for the next steps. DESeq() function will do several things, shortly:

  • The estimation of size factors, controlling for differences in the counts due varying sequencing depth of the samples. This is further explained in a section below.
  • Fitting a final generalized linear model, which gives estimates of the log-fold changes.
dds <- DESeq(dds, parallel = TRUE)

Count transformations

Count normalization

Some parts of the following text and images are taken from here.

The counts of mapped reads for each gene is proportional to the expression of RNA (“interesting”) in addition to many other factors (“uninteresting”). Normalization is the process of scaling raw count values to account for the “uninteresting” factors. In this way the expression levels are more comparable between and/or within samples.

While normalization is essential for differential expression analyses, it is also necessary for exploratory data analysis, visualization of data, and whenever you are exploring or comparing counts between or within samples. However, do not use normalized counts for differential expression testing. Packages for DE testing (such as DESeq2) expects raw counts and apply custom algorithms for normalization.

The main factors often considered during normalization are:

  • Sequencing depth
Accounting for sequencing depth is necessary for comparison of gene expression between samples. In the example below, each gene appears to have doubled in expression in Sample A relative to Sample B, however this is a consequence of Sample A having double the sequencing depth. NOTE: Each pink and green rectangle represents a read aligned to a gene. Reads connected by dashed lines represents reads spanning an intron.

Accounting for sequencing depth is necessary for comparison of gene expression between samples. In the example below, each gene appears to have doubled in expression in Sample A relative to Sample B, however this is a consequence of Sample A having double the sequencing depth. NOTE: Each pink and green rectangle represents a read aligned to a gene. Reads connected by dashed lines represents reads spanning an intron.

  • Gene length
Accounting for gene length is necessary for comparing expression between different genes within the same sample. In the example, Gene X and Gene Y have similar levels of expression, but the number of reads mapped to Gene X would be many more than the number mapped to Gene Y because Gene X is longer.

Accounting for gene length is necessary for comparing expression between different genes within the same sample. In the example, Gene X and Gene Y have similar levels of expression, but the number of reads mapped to Gene X would be many more than the number mapped to Gene Y because Gene X is longer.

  • RNA composition
A few highly differentially expressed genes between samples, differences in the number of genes expressed between samples, or presence of contamination can skew some types of normalization methods. Accounting for RNA composition is recommended for accurate comparison of expression between samples, and is particularly important when performing differential expression analyses (source). In the example, if we were to divide each sample by the total number of counts to normalize, the counts would be greatly skewed by the DE gene, which takes up most of the counts for Sample A, but not Sample B. Most other genes for Sample A would be divided by the larger number of total counts and appear to be less expressed than those same genes in Sample B.

A few highly differentially expressed genes between samples, differences in the number of genes expressed between samples, or presence of contamination can skew some types of normalization methods. Accounting for RNA composition is recommended for accurate comparison of expression between samples, and is particularly important when performing differential expression analyses (source). In the example, if we were to divide each sample by the total number of counts to normalize, the counts would be greatly skewed by the DE gene, which takes up most of the counts for Sample A, but not Sample B. Most other genes for Sample A would be divided by the larger number of total counts and appear to be less expressed than those same genes in Sample B.

Common normalization methods

Several common normalization methods exist to account for these differences:

Normalization method Description Accounted factors Recommendations for use
CPM/FPM (counts/fragments per million) read counts/fragments scaled by total number of reads/fragments sequencing depth gene count comparisons between replicates of the same samplegroup; NOT for within sample comparisons or DE analysis
TPM (transcripts per kilobase million) counts per length of transcript (kb) per million reads mapped sequencing depth and gene length gene count comparisons within a sample or between samples of the same sample group; NOT for DE analysis
RPKM/FPKM (reads/fragments per kilobase of exon per million reads/fragments mapped) similar to TPM, but the total number of RPKM/FPKM normalized counts for each sample will be different sequencing depth and gene length gene count comparisons between genes within a sample; NOT for between sample comparisons or DE analysis
DESeq2’s median of ratios counts divided by sample-specific size factors determined by median ratio of gene counts relative to geometric mean per gene sequencing depth and RNA composition gene count comparisons between samples and for DE analysis; NOT for within sample comparisons

Between sample comparison of the same gene

For between sample comparison, we have to estimate library size, i.e. sequencing depth.

Counts/fragments per million (CPM/FPM)

Counts/fragments per million (CPM/FPM) mapped reads/fragments are counts (\(C_i\)) scaled by the number of reads/fragments you sequenced (\(N\)) times one million. This unit is related to the FPKM without length normalization and a factor of \(10^3\):

\[CPM_i = \frac{C_i}{\frac{N}{10^6}} = \frac{C_i}{N} \cdot 10^6\]

This is the most simple normalization accounting only for sequencing depth. It is usable only for gene count comparisons between replicates of the same sample group where significant changes in gene expression are not expected.

DESeq2-normalized counts: Median of ratios method

Since tools for differential expression analysis are comparing the counts between sample groups for the same gene, gene length does not need to be accounted for by the tool. However, sequencing depth and RNA composition do need to be taken into account.

To normalize for sequencing depth and RNA composition, DESeq2 uses the median of ratios method. On the user-end there is only one step, but on the back-end there are multiple steps involved, as described below.

NOTE: The steps below describe in detail some of the steps performed by DESeq2 when you run a single function to get DE genes. Basically, for a typical RNA-seq analysis, you would not run these steps individually.

Step 1: create a pseudo-reference sample (row-wise geometric mean)

For each gene, a pseudo-reference sample is created that is equal to the geometric mean across all samples.

gene sampleA sampleB pseudo-reference sample
EF2A 1489 906 sqrt(1489 * 906) = 1161.5
ABCD1 22 13 sqrt(22 * 13) = 17.7

Step 2: calculates ratio of each sample to the reference

For every gene in a sample, the ratios (sample/pseudo-reference) are calculated (as shown below). This is performed for each sample in the dataset. Since the majority of genes are not differentially expressed, the majority of genes in each sample should have similar ratios within the sample.

gene sampleA sampleB pseudo-reference sample ratio of sampleA/ref ratio of sampleB/ref
EF2A 1489 906 1161.5 1489/1161.5 = 1.28 906/1161.5 = 0.78
ABCD1 22 13 16.9 22/16.9 = 1.30 13/16.9 = 0.77
MEFV 793 410 570.2 793/570.2 = 1.39 410/570.2 = 0.72
BAG1 76 42 56.5 76/56.5 = 1.35 42/56.5 = 0.74
MOV10 521 1196 883.7 521/883.7 = 0.590 1196/883.7 = 1.35

Step 3: calculate the normalization factor for each sample (size factor)

The median value (column-wise for the above table) of all ratios for a given sample is taken as the normalization factor (size factor) for that sample, as calculated below. Notice that the differentially expressed genes should not affect the median value:

normalization_factor_sampleA <- median(c(1.28, 1.3, 1.39, 1.35, 0.59))

normalization_factor_sampleB <- median(c(0.78, 0.77, 0.72, 0.74, 1.35))

The median of ratios method makes the assumption that not ALL genes are differentially expressed; therefore, the normalization factors should account for sequencing depth and RNA composition of the sample (large outlier genes will not represent the median ratio values). This method is robust to imbalance in up-/down-regulation and large numbers of differentially expressed genes.

NOTE: Usually these size factors are around 1, if you see large variations between samples it is important to take note since it might indicate the presence of extreme outliers.

Step 4: calculate the normalized count values using the normalization factor

This is performed by dividing each raw count value in a given sample by that sample’s normalization factor to generate normalized count values. This is performed for all count values (every gene in every sample). For example, if the median ratio for SampleA was 1.3 and the median ratio for SampleB was 0.77, you could calculate normalized counts as follows:

SampleA median ratio = 1.3

SampleB median ratio = 0.77

Raw Counts

gene sampleA sampleB
EF2A 1489 906
ABCD1 22 13

Normalized counts

gene sampleA sampleB
EF2A 1489 / 1.3 = 1145.39 906 / 0.77 = 1176.62
ABCD1 22 / 1.3 = 16.92 13 / 0.77 = 16.88

NOTE: Please note that normalized count values are not whole numbers.

Normalized counts can be retrieved by:

counts(dds, normalized = TRUE) %>% head()
##                 SRR1039508 SRR1039509  SRR1039512 SRR1039513   SRR1039516 SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003  663.31418  499.90698  740.152838  608.90630  966.3137310  748.37220  836.24873  605.60240
## ENSG00000000419  456.21167  574.66985  526.500473  544.73235  498.4412655  571.10734  452.87756  537.84269
## ENSG00000000457  253.99365  235.44726  222.978461  244.75646  208.0376662  236.59140  253.04669  242.45271
## ENSG00000000460   58.61392   61.37251   33.913074   52.23461   66.2323998   45.03099   82.53884   63.52473
## ENSG00000000938    0.00000    0.00000    1.695654    0.00000    0.8491333    0.00000    0.00000    0.00000
## ENSG00000000971 3175.89749 4105.26291 5237.026440 6345.75883 5707.0251194 7881.85315 5621.32909 8464.66992

Within/between sample comparison

For between gene comparison, we have to estimate both library size and “gene size”, i.e. some measure of transcript length. The simplest method is to take the median of transcript lengths, which is calculated as the sum (in base pairs) of each transcript’s exons. More sophisticated methods/tools such as Salmon, kallisto or Sailfish are using “effective length”, which takes into account the fact that not every transcript in the population can produce a fragment of every length starting at every position.

TPM

Taken from this great blog post:

TPM is very similar to RPKM and FPKM. The only difference is the order of operations. Here’s how you calculate TPM:

  1. Divide the read counts (\(C_i\)) by the length of each gene in kilobases (\(L_i\)). This gives you reads per kilobase (RPK).
  2. Count up all the RPK values in a sample and divide this number by 1 000 000. This is your “per million” scaling factor.
  3. Divide the RPK value by the “per million” scaling factor. This gives you TPM.

\[TPM_i = \frac{C_i}{L_i} \cdot \frac{10^6}{\sum{\frac{C_i}{L_i}}}\]

So you see, when calculating TPM, the only difference from FPKM is that you normalize for gene length first, and then normalize for sequencing depth second. However, the effects of this difference are quite profound.

When you use TPM, the sum of all TPMs in each sample are the same (\(\sum{\frac{C_i}{L_i}}\)). This makes it easier to compare the proportion of reads that mapped to a gene in each sample. In contrast, with RPKM and FPKM, the sum of the normalized reads (\(N\)) in each sample may be different, and this makes it harder to compare samples directly.

Here’s an example. If the TPM for gene A in Sample 1 is 3.33 and the TPM in sample B is 3.33, then I know that the exact same proportion of total reads mapped to gene A in both samples. This is because the sum of the TPMs in both samples always add up to the same number (so the denominator required to calculate the proportions is the same, regardless of what sample you are looking at.)

With RPKM or FPKM, the sum of normalized reads in each sample can be different. Thus, if the RPKM for gene A in Sample 1 is 3.33 and the RPKM in Sample 2 is 3.33, I would not know if the same proportion of reads in Sample 1 mapped to gene A as in Sample 2. This is because the denominator required to calculate the proportion could be different for the two samples.

While TPM and RPKM/FPKM normalization methods both account for sequencing depth and gene length, RPKM/FPKM are not recommended. The reason is that the normalized count values output by the RPKM/FPKM method are not comparable between samples.

Using RPKM/FPKM normalization, the total number of RPKM/FPKM normalized counts for each sample will be different. Therefore, you cannot compare the normalized counts for each gene equally between samples.

Deeper explanation of FPKM and TPM problems can be found here.

Calculating TPM in R

To calculate TPM, we need to know gene lengths. Probably the easiest way is to calculate a median of transcript lengths of each gene. We can use a package GenomicFeatures, which includes a special class TxDb holding information about genomic ranges. This object can be created from GTF or GFF file.

library(GenomicFeatures)

txdb <- makeTxDbFromGFF("../_data/genome/Homo_sapiens.GRCh37.75.gtf", format = "gtf")

# For each gene we get a GRanges object of its transcripts.
# This object holds genomic range of each transcript.
txs_by_gene <- transcriptsBy(txdb, "gene")
# We only want genes which are both in our DESeqDataSet and txs_by_gene.
common_genes <- intersect(rownames(dds), names(txs_by_gene))
dds <- dds[common_genes, ]
txs_by_gene <- txs_by_gene[common_genes]

head(txs_by_gene)
## GRangesList object of length 6:
## $ENSG00000000003
## GRanges object with 3 ranges and 2 metadata columns:
##       seqnames            ranges strand |     tx_id         tx_name
##          <Rle>         <IRanges>  <Rle> | <integer>     <character>
##   [1]        X 99883667-99891803      - |    194231 ENST00000373020
##   [2]        X 99887538-99891686      - |    194232 ENST00000496771
##   [3]        X 99888439-99894988      - |    194233 ENST00000494424
##   -------
##   seqinfo: 265 sequences (1 circular) from an unspecified genome; no seqlengths
## 
## $ENSG00000000419
## GRanges object with 7 ranges and 2 metadata columns:
##       seqnames            ranges strand |     tx_id         tx_name
##          <Rle>         <IRanges>  <Rle> | <integer>     <character>
##   [1]       20 49551404-49575087      - |    182563 ENST00000371588
##   [2]       20 49551404-49575087      - |    182564 ENST00000466152
##   [3]       20 49551404-49575092      - |    182565 ENST00000371582
##   [4]       20 49551433-49562398      - |    182566 ENST00000494752
##   [5]       20 49551482-49575058      - |    182567 ENST00000371584
##   [6]       20 49551490-49575069      - |    182568 ENST00000371583
##   [7]       20 49552685-49575069      - |    182569 ENST00000413082
##   -------
##   seqinfo: 265 sequences (1 circular) from an unspecified genome; no seqlengths
## 
## $ENSG00000000457
## GRanges object with 5 ranges and 2 metadata columns:
##       seqnames              ranges strand |     tx_id         tx_name
##          <Rle>           <IRanges>  <Rle> | <integer>     <character>
##   [1]        1 169818772-169863093      - |     15488 ENST00000367771
##   [2]        1 169821804-169863076      - |     15489 ENST00000367772
##   [3]        1 169822215-169858029      - |     15490 ENST00000367770
##   [4]        1 169823652-169863408      - |     15491 ENST00000423670
##   [5]        1 169828260-169863093      - |     15492 ENST00000470238
##   -------
##   seqinfo: 265 sequences (1 circular) from an unspecified genome; no seqlengths
## 
## ...
## <3 more elements>
# We are only interested in transcript names.
# For each gene we get a list of its transcript names.
txs_by_gene <- lapply(txs_by_gene, function(x) {
    mcols(x, use.names = TRUE)[, "tx_name"]
})

# We can use the parallelized version of lapply(), but it sometimes causes errors.
# txs_by_gene <- bplapply(txs_by_gene, function(x) {
#   mcols(x, use.names = TRUE)[, "tx_name"]
# }, BPPARAM = BPPARAM)

head(txs_by_gene)
## $ENSG00000000003
## [1] "ENST00000373020" "ENST00000496771" "ENST00000494424"
## 
## $ENSG00000000419
## [1] "ENST00000371588" "ENST00000466152" "ENST00000371582" "ENST00000494752" "ENST00000371584" "ENST00000371583" "ENST00000413082"
## 
## $ENSG00000000457
## [1] "ENST00000367771" "ENST00000367772" "ENST00000367770" "ENST00000423670" "ENST00000470238"
## 
## $ENSG00000000460
##  [1] "ENST00000498289" "ENST00000472795" "ENST00000413811" "ENST00000496973" "ENST00000481744" "ENST00000359326" "ENST00000466580" "ENST00000456684"
##  [9] "ENST00000459772" "ENST00000286031"
## 
## $ENSG00000000938
## [1] "ENST00000374005" "ENST00000399173" "ENST00000545953" "ENST00000374004" "ENST00000374003" "ENST00000457296" "ENST00000475472" "ENST00000468038"
## 
## $ENSG00000000971
## [1] "ENST00000439155" "ENST00000367429" "ENST00000496761" "ENST00000359637" "ENST00000466229" "ENST00000470918"
# For each transcript we get a GRanges object of its exons.
exons_by_tx <- exonsBy(x = txdb, by = "tx", use.names = TRUE)

head(exons_by_tx)
## GRangesList object of length 6:
## $ENST00000456328
## GRanges object with 3 ranges and 3 metadata columns:
##       seqnames      ranges strand |   exon_id       exon_name exon_rank
##          <Rle>   <IRanges>  <Rle> | <integer>     <character> <integer>
##   [1]        1 11869-12227      + |         1 ENSE00002234944         1
##   [2]        1 12613-12721      + |         8 ENSE00003582793         2
##   [3]        1 13221-14409      + |        12 ENSE00002312635         3
##   -------
##   seqinfo: 265 sequences (1 circular) from an unspecified genome; no seqlengths
## 
## $ENST00000515242
## GRanges object with 3 ranges and 3 metadata columns:
##       seqnames      ranges strand |   exon_id       exon_name exon_rank
##          <Rle>   <IRanges>  <Rle> | <integer>     <character> <integer>
##   [1]        1 11872-12227      + |         2 ENSE00002234632         1
##   [2]        1 12613-12721      + |         9 ENSE00003608237         2
##   [3]        1 13225-14412      + |        13 ENSE00002306041         3
##   -------
##   seqinfo: 265 sequences (1 circular) from an unspecified genome; no seqlengths
## 
## $ENST00000518655
## GRanges object with 4 ranges and 3 metadata columns:
##       seqnames      ranges strand |   exon_id       exon_name exon_rank
##          <Rle>   <IRanges>  <Rle> | <integer>     <character> <integer>
##   [1]        1 11874-12227      + |         3 ENSE00002269724         1
##   [2]        1 12595-12721      + |         6 ENSE00002270865         2
##   [3]        1 13403-13655      + |        14 ENSE00002216795         3
##   [4]        1 13661-14409      + |        16 ENSE00002303382         4
##   -------
##   seqinfo: 265 sequences (1 circular) from an unspecified genome; no seqlengths
## 
## ...
## <3 more elements>
# For each transcript we get a sum of lengths of its exons.
tx_lengths <- sum(width(exons_by_tx))

head(tx_lengths)
## ENST00000456328 ENST00000515242 ENST00000518655 ENST00000450305 ENST00000473358 ENST00000469289 
##            1657            1653            1483             632             712             535
# For each gene we get a median length of its transcripts.
gene_lengths <- lapply(txs_by_gene, function(x) median(tx_lengths[x])) %>% unlist()
# gene_lengths <- bplapply(txs_by_gene, function(x) median(tx_lengths[x]), BPPARAM = BPPARAM) %>% unlist()

# We sort gene lengths by the order of dds rows.
gene_lengths <- gene_lengths[rownames(dds)]

head(gene_lengths)
## ENSG00000000003 ENSG00000000419 ENSG00000000457 ENSG00000000460 ENSG00000000938 ENSG00000000971 
##          1025.0          1073.0          2916.0          1384.0          2131.5          1638.0
stopifnot(all(rownames(dds) == names(gene_lengths)))

Now we can use the gene_lengths to calculate the TPM values. Following code is just a matrix version of the TPM formula:

# First we calculate Ci/Li from the TPM formula.
x <- counts(dds) / gene_lengths
tpm <- t(t(x) * 1e6 / colSums(x))
head(tpm)
##                 SRR1039508 SRR1039509   SRR1039512 SRR1039513   SRR1039516 SRR1039517 SRR1039520 SRR1039521
## ENSG00000000003  29.998004  21.929473  32.29820620  25.184935  43.35055907  31.904105  37.909764  25.222508
## ENSG00000000419  19.708955  24.081390  21.94723949  21.522746  21.36066089  23.257916  19.611940  21.398341
## ENSG00000000457   4.037684   3.630521   3.42023741   3.558450   3.28061201   3.545392   4.032298   3.549475
## ENSG00000000460   1.963187   1.993887   1.09600346   1.600061   2.20056813   1.421766   2.771161   1.959437
## ENSG00000000938   0.000000   0.000000   0.03558219   0.000000   0.01831854   0.000000   0.000000   0.000000
## ENSG00000000971  89.877196 112.691182 143.00516667 164.241872 160.21246560 210.265115 159.464710 220.607704

Because the TPM matrix has the same dimension, rownames and colnames as the count matrix in dds, we can add it to assays:

assay(dds, "tpm") <- tpm

Removing the mean-variance trend - log2/vst/rlog transformation

Taken from here, shortened:

Many common statistical methods for exploratory analysis of multidimensional data, for example clustering and principal components analysis (PCA), work best for data that generally has the same range of variance at different ranges of the mean values. For RNA-seq raw counts, however, the variance grows with the mean. For example, if one performs PCA directly on a matrix of size-factor-normalized read counts, the result typically depends only on the few most strongly expressed genes because they show the largest absolute differences between samples. A simple and often used strategy to avoid this is to take the logarithm of the normalized count values plus a small pseudocount; however, now the genes with the very lowest counts will tend to dominate the results because, due to the strong Poisson noise inherent to small count values, and the fact that the logarithm amplifies differences for the smallest values, these low count genes will show the strongest relative differences between samples.

As a solution, DESeq2 offers transformations for count data that stabilize the variance across the mean. One such transformation is the regularized-logarithm transformation or rlog2. For genes with high counts, the rlog transformation will give similar result to the ordinary log2 transformation of normalized counts. For genes with lower counts, however, the values are shrunken towards the genes’ averages across all samples. The rlog-transformed data can be used directly for computing distances between samples and making PCA plots.

Another transformation that similarly improves distance calculation across samples, the variance stabilizing transformation implemented in the vst() function, is discussed alongside the rlog in the DESeq2 vignette.

In rlog() function we specify blind = FALSE, which means that differences across donors and treatment should not add to the variance-mean profile of the experiment.

rld <- rlog(dds, blind = FALSE)

rld is object of class DESeqTransform. Counts are retrieved by assay() function. Later we will use the rld object for exploratory analysis, but not for differential expression testing. That’s why we have created an another object than dds.

We can now look what effect different transformations have. In the vsn package there is an useful function meanSdPlot() which plots mean and standard deviation of each gene across the samples.

library(ggplot2)

vsn::meanSdPlot(counts(dds, normalized = TRUE) + 1, plot = FALSE, rank = FALSE)$gg + ggtitle("Normalized counts")

vsn::meanSdPlot((counts(dds, normalized = TRUE) + 1) %>% log2(), plot = FALSE, rank = FALSE)$gg + ggtitle("log2(normalized counts + 1)")

vsn::meanSdPlot(assay(rld), plot = FALSE, rank = FALSE)$gg + ggtitle("rlog-normalized counts")

You can see that in normalized, but untransformed counts, genes with the highest mean also have the highest standard deviation. This is the mean-variance trend in RNA-seq data.

Log2 transformation makes it better, but now counts with the smallest values dominate in standard deviation, because log2 amplified their values the most.

Rlog transformation seems to be the best, as it correctly shrunk genes with low counts to average across all samples.

We can also look at a sample-to-sample count plot:

par(mfrow = c(1, 3))
plot(counts(dds, normalized = TRUE)[, 1:2] + 1, pch = 16, cex = 0.3)
plot((counts(dds, normalized = TRUE)[, 1:2] + 1) %>% log2(), pch = 16, cex = 0.3)
plot(assay(rld)[, 1:2], pch = 16, cex = 0.3)

Again, the mean-variance trend is visible in untransformed counts. Log2 transformation makes it better, but many low count genes now show the strongest relative differences between samples. Rlog correctly transforms them, as they mean across all samples is very low and they are just a technical noise.


Exploratory analysis

Exploratory analysis will help us to assess the main characteristics of our data. We can check if our experiment makes sense at biological level and if it was done correctly at technical level. Don’t forget to use the rlog-transformed data (rld object).

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a technique used to emphasize variation and bring out strong patterns in a dataset (dimensionality reduction).

DESeq2 package has a plotPCA() function for that. You can specify the parameter intgroup to color points by column from the sample sheet.

plotPCA(rld, intgroup = "dex")

plotPCA(rld, intgroup = "cell")

It seems that samples are correctly divided by treatment. There is an evident difference between N080611 (blue) and other cell lines.

Note that plotPCA() is returning a ggplot2 object, which can be saved and further modified:

p <- plotPCA(rld, intgroup = "dex")
p + ggtitle("PCA plot") + theme_bw()

You can also only return a dataframe with principal components and add more aesthetics to the plot:

p <- plotPCA(rld, intgroup = "dex", returnData = TRUE)
head(p)
p <- data.frame(p, cell = rld$cell, SampleName = rld$SampleName)
head(p)
ggplot(p, aes(x = PC1, y = PC2)) +
  # We want to show points, color them by cell and shape them by dex.
  geom_point(aes(color = cell, shape = dex), size = 5) +
  # One unit of x is equal to one unit of y.
  coord_fixed() +
  ggtitle("PCA plot") +
  # Annotate points.
  ggrepel::geom_text_repel(aes(label = SampleName)) +
  # Black and white theme.
  theme_bw()

Hierarchical clustering

Hierarchical clustering is clustering samples by their (Euclidean) distance. Heatmap is used to show how samples are clustered (dendrogram) and what is their distance from each other (color).

We will use a package heatmaply, which offers an interactive heatmap. But first we need to calculate a distance matrix of the samples.

sample_dists <- assay(rld) %>% t() %>% dist() %>% as.matrix()
head(sample_dists)
##            SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 SRR1039521
## SRR1039508    0.00000   45.70156   39.25598   62.63597   44.50960   64.49866   39.58025   63.36519
## SRR1039509   45.70156    0.00000   54.91227   44.53083   59.06801   51.45301   57.46653   45.06067
## SRR1039512   39.25598   54.91227    0.00000   48.72886   43.58240   59.23395   36.74745   57.87995
## SRR1039513   62.63597   44.53083   48.72886    0.00000   63.74705   49.88392   58.49396   36.49798
## SRR1039516   44.50960   59.06801   43.58240   63.74705    0.00000   47.48508   46.41177   65.55040
## SRR1039517   64.49866   51.45301   59.23395   49.88392   47.48508    0.00000   63.60393   52.32102
library(heatmaply)

# We have to be sure that column order of the distance matrix is the same as in sample sheet.
treatment <- colData(dds)[colnames(sample_dists), "dex"]
treatment
## [1] untrt trt   untrt trt   untrt trt   untrt trt  
## Levels: untrt trt
cell_line <- colData(dds)[colnames(sample_dists), "cell"]
cell_line
## [1] N61311  N61311  N052611 N052611 N080611 N080611 N061011 N061011
## Levels: N052611 N061011 N080611 N61311
heatmaply(sample_dists, col_side_colors = treatment, row_side_colors = cell_line)

You can see that clustering of the samples has an interpretable biological meaning - they cluster by the treatment.

Instead of the distance matrix we can also calculate a sample correlation matrix:

sample_cor <- assay(rld) %>% cor()
sample_cor
##            SRR1039508 SRR1039509 SRR1039512 SRR1039513 SRR1039516 SRR1039517 SRR1039520 SRR1039521
## SRR1039508  1.0000000  0.9982056  0.9986694  0.9966427  0.9982853  0.9964280  0.9986452  0.9965726
## SRR1039509  0.9982056  1.0000000  0.9974051  0.9983024  0.9969986  0.9977279  0.9971577  0.9982657
## SRR1039512  0.9986694  0.9974051  1.0000000  0.9979696  0.9983604  0.9969871  0.9988328  0.9971379
## SRR1039513  0.9966427  0.9983024  0.9979696  1.0000000  0.9965235  0.9978695  0.9970713  0.9988612
## SRR1039516  0.9982853  0.9969986  0.9983604  0.9965235  1.0000000  0.9980705  0.9981374  0.9963322
## SRR1039517  0.9964280  0.9977279  0.9969871  0.9978695  0.9980705  1.0000000  0.9965255  0.9976596
## SRR1039520  0.9986452  0.9971577  0.9988328  0.9970713  0.9981374  0.9965255  1.0000000  0.9978600
## SRR1039521  0.9965726  0.9982657  0.9971379  0.9988612  0.9963322  0.9976596  0.9978600  1.0000000
# Order im correlation matrix could be different, so better is to get the coloring values again.
treatment <- colData(dds)[colnames(sample_dists), "dex"]
cell_line <- colData(dds)[colnames(sample_dists), "cell"]

heatmaply(sample_cor, col_side_colors = treatment, row_side_colors = cell_line)

All samples show pretty high correlations and this suggests there are no outlying samples. Again, they cluster by treatment.

Together, these plots suggest to us that the data make biologically sense and we can safely proceed to the differential expression analysis.

Boxplots

Boxplots are a great way to graphically show differences between groups. Rlog-transformed data are used for this purpose. However, always use a proper statistical model (in our case DESeq2) to test for differential expression. Boxplots are great for exploring, but not objective for assessing the differences.

We will use ggboxplot() function from the ggpubr package. This package provides several useful publication-ready ggplot plots.

However, ggplot is generally aimed on plotting data in long format, where each observation (i.e. gene, counts, sample ID, sample group, etc.) has its own row. Our data (count matrix) is in wide format: each value has only gene and sample ID assigned.

We will use the package tidyr to convert count matrix in wide format to long format.

NOTE: Long format consumes much more memory.

count_matrix <- assay(rld)
# First we add gene annotation from rowData.
count_matrix <- cbind(count_matrix, rowData(rld)[, c("SYMBOL", "GENENAME")]) %>% as.data.frame()
# tidyr::pivot_longer() function is converting wide data to the long format.
# key = name of column in long data with column names of wide data
# value = name of column in long data with values of wide data
# The last arguments indicate columns in wide data which don't contain values.
long_data <- tidyr::pivot_longer(
  count_matrix,
  # We want to convert ("expand") these columns to the long format.
  -c(SYMBOL, GENENAME),
  # We want to add observation IDs (colnames) to this column.
  names_to = "sample_id",
  # We want to add observation values (colnames) to this column.
  values_to = "rlog_count"
)

# We can remove the temporary count matrix to free memory.
rm(count_matrix)

head(long_data)

As you can see, for each sample and gene we have its counts and annotation on separated row. But we are still missing the data from the sample sheet. To do it, we will join our long data with the sample sheet, using the sample_id column. left_join(x, y) function from dplyr package is doing that:

Return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.

Visual example of joining two tables. Table 1 and Table 2 could be our long data and sample sheet. id column is similar to our Sample_ID column. In the resulting table on bottom, to Table 1 there are added rows and columns of the rows whose id matches in Table 2.

Visual example of joining two tables. Table 1 and Table 2 could be our long data and sample sheet. id column is similar to our Sample_ID column. In the resulting table on bottom, to Table 1 there are added rows and columns of the rows whose id matches in Table 2.

sample_sheet <- colData(rld) %>% as.data.frame()
sample_sheet$sample_id <- rownames(sample_sheet)
long_data <- dplyr::left_join(long_data, sample_sheet, by = "sample_id")
# We will also remove rows with NA gene symbol.
long_data <- long_data[!is.na(long_data$SYMBOL), ]
head(long_data)

Now our long data are complete and we can proceed to boxplots. Note that we want to work with single genes, so we have to filter our long data. First we will plot boxplots of cell lines and treatments:

plot_boxplot_ggplot2(
  long_data[long_data$SYMBOL %in% c("CCND1", "CCND2", "CCND3"), ],
  x = "dex",
  y = "rlog_count",
  feature_col = "SYMBOL",
  color_by = "dex",
  main = "Boxplot of rlog transformed counts",
  x_lab = "Sample group",
  y_lab = "rlog(counts)",
  do_t_test = FALSE
)

plot_boxplot_ggplot2(
  long_data[long_data$SYMBOL == "CCND1", ],
  x = "cell",
  y = "rlog_count",
  feature_col = "SYMBOL",
  feature = "CCND1",
  color_by = "cell",
  facet_by = "dex",
  main = "Boxplot of rlog transformed counts",
  x_lab = "Cell line",
  y_lab = "rlog(counts)",
  do_t_test = FALSE
)

However, our data contains very few samples in each group, so in this case boxplots don’t make much sense.


Finishing up

It’s a good practice to include all warnings (warnings()) and errors (traceback()), and also information about your environment - R version, libraries, etc. (sessionInfo()) - on the end of your script. We also save our current variables to a single file from which we can later restore them.

save.image("03_exploratory_analysis.RData")

warnings()
traceback()
## No traceback available
sessionInfo()
## R version 3.6.3 (2020-02-29)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Debian GNU/Linux 9 (stretch)
## 
## Matrix products: default
## BLAS:   /usr/lib/libblas/libblas.so.3.7.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.7.0
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8   
##  [6] LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] friendlyeval_0.1.1          heatmaply_1.1.0             viridis_0.5.1               viridisLite_0.3.0           plotly_4.9.2.1             
##  [6] ggplot2_3.3.0               GenomicFeatures_1.38.2      org.Hs.eg.db_3.10.0         AnnotationDbi_1.48.0        DESeq2_1.26.0              
## [11] airway_1.6.0                SummarizedExperiment_1.16.1 DelayedArray_0.12.3         matrixStats_0.56.0          Biobase_2.46.0             
## [16] GenomicRanges_1.38.0        GenomeInfoDb_1.22.1         IRanges_2.20.2              S4Vectors_0.24.4            BiocGenerics_0.32.0        
## [21] magrittr_1.5                BiocParallel_1.20.1         emo_0.0.0.9000              glue_1.4.0                  knitr_1.28                 
## [26] rmarkdown_2.1              
## 
## loaded via a namespace (and not attached):
##   [1] backports_1.1.6          Hmisc_4.4-0              BiocFileCache_1.10.2     plyr_1.8.6               lazyeval_0.2.2           splines_3.6.3           
##   [7] crosstalk_1.1.0.1        usethis_1.6.1            digest_0.6.25            foreach_1.5.0            htmltools_0.4.0          gdata_2.18.0            
##  [13] fansi_0.4.1              checkmate_2.0.0          memoise_1.1.0            cluster_2.1.0            gclus_1.3.2              limma_3.42.2            
##  [19] remotes_2.1.1            Biostrings_2.54.0        annotate_1.64.0          askpass_1.1              prettyunits_1.1.1        jpeg_0.1-8.1            
##  [25] colorspace_1.4-1         blob_1.2.1               rappdirs_0.3.1           ggrepel_0.8.2            xfun_0.13                dplyr_0.8.5             
##  [31] callr_3.4.3              crayon_1.3.4             RCurl_1.98-1.2           jsonlite_1.6.1           hexbin_1.28.1            genefilter_1.68.0       
##  [37] iterators_1.0.12         survival_3.1-12          registry_0.5-1           gtable_0.3.0             zlibbioc_1.32.0          XVector_0.26.0          
##  [43] webshot_0.5.2            pkgbuild_1.0.7           scales_1.1.0             vsn_3.54.0               DBI_1.1.0                Rcpp_1.0.4.6            
##  [49] xtable_1.8-4             progress_1.2.2           htmlTable_1.13.3         foreign_0.8-76           bit_1.1-15.2             preprocessCore_1.48.0   
##  [55] Formula_1.2-3            htmlwidgets_1.5.1        httr_1.4.1               gplots_3.0.3             RColorBrewer_1.1-2       acepack_1.4.1           
##  [61] ellipsis_0.3.0           pkgconfig_2.0.3          XML_3.99-0.3             farver_2.0.3             nnet_7.3-14              dbplyr_1.4.3            
##  [67] locfit_1.5-9.4           reshape2_1.4.4           tidyselect_1.0.0         labeling_0.3             rlang_0.4.5              munsell_0.5.0           
##  [73] tools_3.6.3              cli_2.0.2                generics_0.0.2           RSQLite_2.2.0            devtools_2.3.0           evaluate_0.14           
##  [79] stringr_1.4.0            yaml_2.2.1               processx_3.4.2           bit64_0.9-7              fs_1.4.1                 caTools_1.18.0          
##  [85] purrr_0.3.4              dendextend_1.13.4        biomaRt_2.42.1           compiler_3.6.3           rstudioapi_0.11          curl_4.3                
##  [91] png_0.1-7                ggsignif_0.6.0           testthat_2.3.2           affyio_1.56.0            tibble_3.0.1             geneplotter_1.64.0      
##  [97] stringi_1.4.6            ps_1.3.2                 desc_1.2.0               lattice_0.20-41          Matrix_1.2-18            vctrs_0.2.4             
## [103] pillar_1.4.3             lifecycle_0.2.0          BiocManager_1.30.10      data.table_1.12.8        bitops_1.0-6             seriation_1.2-8         
## [109] rtracklayer_1.46.0       R6_2.4.1                 latticeExtra_0.6-29      affy_1.64.0              TSP_1.1-10               KernSmooth_2.23-17      
## [115] gridExtra_2.3            codetools_0.2-16         sessioninfo_1.1.1        gtools_3.8.2             MASS_7.3-51.6            assertthat_0.2.1        
## [121] pkgload_1.0.2            openssl_1.4.1            rprojroot_1.3-2          withr_2.2.0              GenomicAlignments_1.22.1 Rsamtools_2.2.3         
## [127] GenomeInfoDbData_1.2.2   hms_0.5.3                grid_3.6.3               rpart_4.1-15             tidyr_1.0.2              ggpubr_0.2.5            
## [133] lubridate_1.7.8          base64enc_0.1-3

HTML rendering

This chunk is not evaluated (eval = FALSE). Otherwise you will probably end up in recursive hell 🤯

library(rmarkdown)
library(knitr)
library(glue)

# You can set global chunk options. Options set in individual chunks will override this.
opts_chunk$set(warning = FALSE, message = FALSE)
render("03_exploratory_analysis.Rmd", output_file = "03_exploratory_analysis.html")